EDIT : My bad, I missed the 14th tutorial, sorry about that ...
Hey there !
First, thank you so much for your tutorials in machine learning, I've already done two other tutorial and I could barely understand what they were talking about... In your case, it's clear, you explain everything, even basics stuffs, so thank you !
I just have an issue on the part 18 of your tutorial, when i want to launch the code, I've got this error :
Traceback (most recent call last): File "C:UsersbricheDocumentsPython ScriptsProgrammingPython.netK-Nearest Neighbors18th course - Testing K-NN Classifier.py", line 32, in <module> df.drop(['id'], 1, inplace = True) File "C:UsersbricheAppDataLocalContinuumAnaconda3libsite-packagespandascoregeneric.py", line 1907, in drop new_axis = axis.drop(labels, errors=errors) File "C:UsersbricheAppDataLocalContinuumAnaconda3libsite-packagespandasindexesbase.py", line 3262, in drop labels[mask]) ValueError: labels ['id'] not contained in axis
I think it's because I don't have any header on my dataset, and that I could manage to fix this by addind headers in the txt file, but I just wanted to know if an other way to fix this is possible ?
here's my code : <pre class='prettyprint lang-py'> import numpy as np from math import sqrt import warnings from collections import Counter import pandas as pd import random
def k_nearest_neighbors(data, predict, k = 3): if len(data) >= k: warnings.warn('K is set to a value less than total voting groups !') distances = [] #for each class in data, so here, k and r for group in data: for features in data[group]: #faster way to calculate Euclidean Distance than the hard coded algorithm euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict)) distances.append([euclidean_distance, group])
#Remember, : means to when in front of a list, and from when at the end. Here we're saying that after we sorted the distances, we only care about the distances to k votes = [i[1] for i in sorted(distances)[:k]] #most common comes as list of a tuple so we take 0 first and we get the list, and then we 0 the game because that list tells you the most common group, and then, how many they were (Pas compris mais d'accord) print(Counter(votes).most_common(1)) vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data') df.replace('?', -99999, inplace = True) df.drop(['id'], 1, inplace = True) #Next line is because if you just keep df, some of the values come as quotes full_data = df.astype(float).values.tolist() random.shuffle(full_data)
#next line means that we nultiply test_size per len(full_data) to create an index value that will be cast to an int. So here, train_data will be the 80% first of the data, and the test_data, the 20% remaining train_data = full_data[:-int(test_size*len(full_data))] test_data = full_data[-int(test_size*len(full_data)):]
#Here, we gonna append lists into the train_set list, and that list is elements up to the last element. Here, we populate our dictionnary for i in train_data: train_set[i[-1]].append(i[:-1]) for i in test_data: test_set[i[-1]].append(i[:-1])
correct = 0 total = 0
for group in test_set: for data in test_set[group]: vote = k_nearest_neighbors(train_set, data, k=5) if group == vote: correct +=1 total += 1
print('Accuracy:', correct/total) </pre>
And the dataset I'm using : https://github.com/Jamy4000/machine_learning_tutorials/blob/master/ProgrammingPython.net/K-Nearest%20Neighbors/breast-cancer-wisconsin.data.txt
Thank you for your time !
You must be logged in to post. Please login or register an account.